mlatoz

Upper Confidence Bound (UCB)

The Multi-Armed Bandit Problem - Summary

We have d arms. For example, arms are ads that we display to users each time they connect to a web page.
Each time a user connects to this web page, that makes a round.
At each round n, we choose one ad to display to the user.
At each round n, ad i gives reward r_i(n) ∈ {0, 1}: r_i(n) = 1 if the user clicked on the ad i, 0 if the user didn’t.
Our goal is to maximize the total reward we get over many rounds.

Upper Confidence Bound Algorithm

Step 1: At each round n, we consider two numbers for each ad i: * N_i(n) - the number of times the ad i was selected up to round n, * R_i(n) - the sum of rewards the ad i up to round n.

Step 2: From these two numbers we compute:

the average reward of ad i up to round n
r_i(n) = R_i(n) / N_i(n)
the confidence interval [r_i(n) - Δ_i(n), r_i(n) + Δ_i(n)] at round n with
Δ_i(n) = √(3 / 2) * (log(n) / N_i(n))

Step 3: We select the ad i that has the maximum UCB r_i(n) + Δ_i(n).

Download Resources

«Previous